Utilising identifier error variation in linkage of large administrative data sources

نویسندگان

Katie Harron

Gareth Hagger-Johnson

Ruth Gilbert

Harvey Goldstein

چکیده

BACKGROUND Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors, violating the assumption of independence between identifiers required for traditional probabilistic match weight estimation. This potentially introduces selection bias to the resulting linked dataset. We aimed to measure variation in identifier error rates in a large English administrative data source (Hospital Episode Statistics; HES) and to incorporate this information into match weight calculation. METHODS We used 30,000 randomly selected HES hospital admissions records of patients aged 0-1, 5-6 and 18-19 years, for 2011/2012, linked via NHS number with data from the Personal Demographic Service (PDS; our gold-standard). We calculated identifier error rates for sex, date of birth and postcode and used multi-level logistic regression to investigate associations with individual-level attributes (age, ethnicity, and gender) and organisational variation. We then derived: i) weights incorporating dependence between identifiers; ii) attribute-specific weights (varying by age, ethnicity and gender); and iii) organisation-specific weights (by hospital). Results were compared with traditional match weights using a simulation study. RESULTS Identifier errors (where values disagreed in linked HES-PDS records) or missing values were found in 0.11% of records for sex and date of birth and in 53% of records for postcode. Identifier error rates differed significantly by age, ethnicity and sex (p < 0.0005). Errors were less frequent in males, in 5-6 year olds and 18-19 year olds compared with infants, and were lowest for the Asian ethic group. A simulation study demonstrated that substantial bias was introduced into estimated readmission rates in the presence of identifier errors. Attribute- and organisational-specific weights reduced this bias compared with weights estimated using traditional probabilistic matching algorithms. CONCLUSIONS We provide empirical evidence on variation in rates of identifier error in a widely-used administrative data source and propose a new method for deriving match weights that incorporates additional data attributes. Our results demonstrate that incorporating information on variation by individual-level characteristics can help to reduce bias due to linkage error.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An efficient record linkage scheme using graphical analysis for identifier error detection

BACKGROUND Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone. METHODS We describ...

متن کامل

Error Estimation in Linking Heterogeneous Data Sources Error Estimation in Linking Heterogeneous Data Sources

BACKGROUND Record linkage is the process of bringing together related records that have been compiled separately [1]. Many types of studies have been conducted that have used different methods and approaches to link medical records obtained from heterogeneous data sources. For example, the use of administrative data for research purposes has led to considerable interest in computerized methods ...

متن کامل

Evaluation of record linkage of two large administrative databases in a middle income country: stillbirths and notifications of dengue during pregnancy in Brazil

BACKGROUND Due to the increasing availability of individual-level information across different electronic datasets, record linkage has become an efficient and important research tool. High quality linkage is essential for producing robust results. The objective of this study was to describe the process of preparing and linking national Brazilian datasets, and to compare the accuracy of differen...

متن کامل

Linkage between neurological registry data and administrative data.

This section of the guideline discusses considerations when planning and designing a registry that will require linkage to administrative data. Administrative data may include hospitalization and surgical or other procedure data; physician billing data; vital statistics data (e.g. births, deaths); prescription and other pharmacy data; long-term care services and admissions; and other data colle...

متن کامل

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 17 شماره

صفحات -

تاریخ انتشار 2017

Utilising identifier error variation in linkage of large administrative data sources

نویسندگان

چکیده

منابع مشابه

An efficient record linkage scheme using graphical analysis for identifier error detection

Error Estimation in Linking Heterogeneous Data Sources Error Estimation in Linking Heterogeneous Data Sources

Evaluation of record linkage of two large administrative databases in a middle income country: stillbirths and notifications of dengue during pregnancy in Brazil

Linkage between neurological registry data and administrative data.

Probabilistic Linkage of Persian Record with Missing Data

عنوان ژورنال:

اشتراک گذاری